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Abstract. We provide a comprehensive overview of latent Markov (LM) 
models for the analysis of longitudinal categorical data. The main as- 
Cdj . sumption behind these models is that the response variables are condi- 

tionally independent given a latent process which follows a first-order 
Markov chain. We first illustrate the basic LM model in which the con- 
ditional distribution of each response variable given the corresponding 
latent variable and the initial and transition probabilities of the latent 
p_H ' process are unconstrained. For this model we also illustrate in detail 

C/3 ■ maximum likelihood estimation through the Expectation-Maximization 

^ c*j algorithm, which may be efficiently implemented by recursions known 

in the hidden Markov literature. We then illustrate several constrained 
versions of the basic LM model, which make the model more parsimo- 
nious and allow us to include and test hypotheses of interest. These 
constraints may be put on the conditional distribution of the response 
■ variables given the latent process (measurement model) or on the dis- 

^t" , tribution of the latent process (latent model). We also deal with ex- 

qq \ tensions of LM model for the inclusion of individual covariates and to 

£NJ ■ multilevel data. Covariates may affect the measurement or the latent 

fT} ■ model; we discuss the implications of these two different approaches 

CD I according to the context of application. Finally, we outline methods 

for obtaining standard errors for the parameter estimates, for select- 
ing the number of states and for path prediction. Models and related 
inference are illustrated by the description of relevant socio-economic 
^ ' applications available in the literature. 
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1. INTRODUCTION 

In many applications involving longitudinal data, the interest is often focused 
on the evolution of a latent characteristic of a group of individuals over time, 
which is measured by one or more occasion-specific response variables. This char- 
acteristic may correspond, for instance, to the quality-of-life of subjects suffering 
from a certain disease, which is indirectly assessed on the basis of responses to a 
set of suitably designed items that are repeatedly administered during a certain 
period of time. 

In the statistical and econometric literatures, several approaches have been 
introduced to address the above issue. Among these approaches, one of the most 
interesting is based on the same formulation of the hidden Markov model for 
time series (Baum and Petrie, 1966; MacDonald and Zucchini, 1997). The main 
assumption behind this approach is that the response variables are conditionally 
independent given a latent Markov chain with a finite number of states. The basic 
idea behind this assumption, that we will refer to as the assumption of local 
independence, is that the latent process fully explains the observable behavior 
of a subject; moreover, the latent state to which a subject belongs at a certain 
occasion only depends on the latent state at the previous occasion. 

The starting point of the paper is the latent Markov (LM) model. This model 
dates back to Wiggins' Ph.D. thesis, Wiggins (1955), who introduced a version 
of this model based on a homogenous Markov chain, a single outcome at each 
occasion, and did not account for individual covariates. Wiggins (1955) formu- 
lated the model so that a manifest transition is a mixture of a true change and a 
spurious change due to measurement errors in the observed states. See also Wig- 
gins (1973) for a deep illustration of this model and some simple generalizations. 
It is also worth mentioning Van de Pol and De Leeuw (1986), van de Pol and 
Langeheine (1990), Collins and Wugalter (1992), and Langeheine and Van de Pol 
(1994), among the first papers dealing with this model. 

In the following, we refer to the LM model based on a first-order Markov 
chain, non-homogeneous transition probabilities, and no covariates as the basic 
LM model. This model may be used for univariate or multivariate data; in the 
second case we observe more response variables at each occasion. For the ba- 
sic LM model we discuss in detail maximum likelihood estimation through the 
Expectation-Maximization (EM) algorithm (Baum et al., 1970; Dempster et al., 
1977), even though we acknowledge that other estimation methods are available 
(Archer and Titterington, 2009; Cappe et al., 1989; Kiinsch, 2005; Turner, 2008). 
For the implementation of the EM algorithm we illustrate suitable recursions 
which allow us to strongly reduce the computational effort. 

The paper also focuses on several constrained versions and extensions of the ba- 
sic LM model. Constraints have the aim of making the model more parsimonious 
and easier to interpret and correspond to certain hypotheses that may be inter- 
esting to test. These constraints may be posed on the measurement model, i.e. the 
conditional distribution of the response variables given the latent process, or on 
the latent model, i.e. the distribution of the latent process. About the measure- 
ment model, we discuss in detail Rasch (1961) type parameterizations which make 
the latent states interpretable in terms of ability or propensity levels. About the 
latent model, we outline several simplifications of the transition matrix, mostly 
based on constraints of equality between certain elements of this matrix and/or 
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on the constraint that some elements are equal to 0. One of the main problems 
is how to test for these restrictions. For this aim, we make use of the likelihood 
ratio (LR) statistic. It is important to note that, when constrains concern the 
transition matrix, the null asymptotic distribution does not necessarily have an 
asymptotic chi-squared distribution, but a distribution of chi-bar-squared type 
(Bartolucci, 2006). 

The most natural extension of the basic LM model is for the inclusion of in- 
dividual covariates. In particular, we describe two different ways of including 
such covariates: (i) in the measurement model, so that they affect the conditional 
distribution of the response variables given the latent process (Bartolucci and 
Farcomeni, 2009); (it) in the latent model, so that they affect the initial and the 
transition probabilities of the Markov chain (Vermunt et al., 1999; Bartolucci 
et al., 2007b). Further, we discuss methods to relax the assumption of local inde- 
pendence and show multilevel LM models which are suitable when subjects are 
collected in clusters. In this context, the model may be formulated by including 
fixed parameters to represent the effect of the clusters on the distribution of the 
latent process corresponding to every subject belonging to these clusters; this 
approach was followed by Bartolucci et al. (2009). Otherwise, the model may 
be formulated by including random parameters having a discrete distribution, 
following in this way an approach similar to that behind the latent class model, 
as in Bartolucci et al. (2010). This extension is related to the mixed LM model 
(van de Pol and Langeheine, 1990) and to the LM model with random effects 
(Altaian, 2007). 

Finally, we revise methods for obtaining standard errors for the model param- 
eters and for selecting the number of latent states. We also discuss the problem of 
path prediction through the Viterbi algorithm (Viterbi, 1967; Juang and Rabiner, 
1991). 

An important point to clarify is that, throughout the paper, we consider the 
case of categorical response variables because this is the typical case of application 
of the LM model. However, it is straightforward to modify the framework to deal 
with continuous outcomes. Most of the theory and estimation methods do not 
change substantially. 

The paper is organized as follows. In the Section 2 we outline the basic LM 
model and discuss maximum likelihood estimation for this model based on the EM 
algorithm. In Section 3 we outline constrained versions of the LM model based on 
parsimonious and interpretable parameterizations. In Section 4 we illustrate how 
to deal with individual covariates, whereas the multilevel extension is presented 
in Section 5. Section 6 deals with standard errors, selection of the number of 
states, and path prediction. Section 7 illustrates different types of LM model 
through various examples involving longitudinal categorical data, summarizing 
the results from other papers. The paper ends with a section where we draw 
main conclusions and discuss further developments of the present framework. 

2. BASIC LATENT MARKOV MODEL AND ITS MULTIVARIATE 

VERSION 

In the following, we illustrate the basic LM model for univariate categorical 
data without covariates and in which the latent Markov chain is of first-order and 
non-homogenous. We also describe the EM algorithm for maximum likelihood 
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estimation of this model. 

2.1 Basic formulation of univariate responses 

(1) (T) 

Let Yi = (Yj , Y> ), i = 1, ...,n, be a sequence of T categorical re- 
sponse variables with I levels or categories, coded from to I — 1, independently 
observed over n subjects. Typically, these variables have the same nature, as they 
correspond to repeated measurements on the same subjects at different occasions. 
However, the approach may be applied to the case of response variables having 
a different nature and, possibly, a different number of categories. 

The main assumption of the basic LM model is that of local independence, i.e. 
for every subject i the response variables in Yi are conditionally independent 

(1) (T) 

given a latent process Ui = (U> U> '). This latent process is assumed 

to follow a first-order Markov chain with state space {1, . . . ,k}. Then, for all 
t > 2, the latent variable is conditionally independent of ET^ 1 , . . . , llj^ 2 ^ 
given Xjf . See Figure 1 for an illustration via path diagram. 



y-(l) y(2) yp 



u? uf . ... . uf ] 



Fig 1 . Path diagram of the basic LM model for univariate data. 



The latent process can be used to model a real unidimensional latent trait 
evolving over time, or just as a parsimonious device to allow for a time non- 
homogeneous distribution for the sequence Yi . The Markov assumption is seldom 
found to be restrictive, and is easily interpretable. A further interpretation of the 
model is given by the measurement error framework: the unobservable outcome 
is observed with measurement error as Y^ . Therefore, it may be seen as an 
extension of a standard Markov model (Anderson, 1954); for a discussion on this 
point see (Wiggins, 1973, Chapter 4). 

Parameters of the model are the conditional response probabilities 4>y\ u = 
p(Y® = y\Ui = u), t = 1, . . . , T, u = 1, . . . , k, y = 0, — 1, the initial 
probabilities ir u = p(U^ = u), u = l,...,k, and the transition probabilities 

ttJA = p (Up = v\U~f~ l) = it), t = 2, . . . , T, u, v = 1, . . . , k. Note that all these 
probabilities do not depend on i since, in its basic version, the model does not 
account for individual covariates. 

On the basis of the above parameters, the distribution of U i may be expressed 

as 

(!) P( u i = u)= 7T u( i) YI n% | u( t-i)> 

t>i 

where u = {u^ l \ . . . , u^). Moreover, the conditional distribution of Y", given Ui 
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may be expressed as 

p(Yi = y\Ui = u) = n^V*)- 
t 

and, consequently, for the manifest distribution of Y i we have 
(2) f(y) = p(Y i = y) = Y J f(^y) = 



E,(l) ST J 2 ^ ( 2 ) \^^( T ) ( T ) 

VI) |uW ^f 1 ' ^y( 2 ) |wW ^(2) | U (2) ' 'y(T) \ U (T) ^ U (T) \ U (T-1) , 

uW «(2) U (T) 



where y = . . . , y (T) ) and f(u,y) = p{Yi = y\V \ = u)p(Ui = u). It 

is important to note that computing f(y) as expressed above involves a sum 
extended to all the possible k T configurations of the vector u; this typically 
requires a considerable computational effort. 

In order to efficiently compute the probability f(y), we can use a forward 
recursion (Baum et al., 1970) for obtaining q^(u,y) = p(U^ = u,Y^ = 
y«, . . . , y. (t) = for t = 1, . . . , T. We then have 

/(y) = E^ (T) ( u ^)- 

In particular, given y), u = 1, . . . , k, the t-th iteration of the recursion, 

t = 2, . . . , T, consists of computing 

(3) ffW („, ») = £ g^" 1 ) (n, y)7rW ^l )|u(t) , « = !,...,*, 

starting with <? (1) (/u,y) = 7r«^(i)| u - 

The above recursion may be simply implemented by using the matrix notation 
(Bartolucci et al., 2007b). Let q^\y) denote the column vector with elements 
q^(u,y), u = 1,. . . ,k, and in similar way define the initial probability vector 
7r with elements ir u and the conditional probability vector (j>y with elements 
d> i . Also let 11^ denote the transition probability matrix with elements ir^) , 
u, v = 1, . . . , k, with u running by row and v by column. We then have 

,., f diag[0vJ7r, if t = 1, 

\ diag[0^ ) ][nW]'q('- 1 )(y), otherwise, 

and at the end f(y) = q( T \y)'l, where 1 denotes a column vector of ones 
of suitable dimension. In implementing this recursion, attention must be payed 
to the case of large values of T, because during the recursion the probabilities 
q^\u,y) could become very small; see Scott (2002). 

2.2 Multivariate version 

In the multivariate case we observe a vector = (Y^' , . . . , Y^ ) of r re- 
sponse variables for every subject i and occasion t. For j = 1, . . . , r, the response 
f£l has lj levels coded from to lj 



variable Y^' has lj levels coded from to lj — 1. We also denote by Y^ the 
response vector made of the union of the vectors Y p for t = 1, . . . , T. 
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The LM model presented in Section 2.1 may be naturally formulated for mul- 
tivariate data by considering an extended version of the assumption of local in- 
dependence. Further to assuming that for every i the vectors Yf\ t = 1, . . . ,T, 
are conditionally independent given C/j, we assume that the response variables in 
each vector are conditionally independent given uf \ The resulting model 
is represented by the path diagram in Figure 2 



Y (T) ... Y (T) 

il ir 

- U iT) 
i 



Fig 2. Path diagram of the basic LM model for multivariate data. 




The model assumptions imply that 
(5) p{Y t = y\Ui =u) = X\p(Yf ] = = 



where y is made of the subvectors y^ = (yf \ ■ ■ ■ , y^) and 

(6) ^? ) =« w i^ ) =« w )=n^v*)' 

3 3 

with <f>jy\ u = p(Y^ = y\Uj = u). The manifest probability f(y) has then 
the same expression as in (2), with p{Yi = y\Ui = u) computed as in (5). 
This manifest probability may be computed by exploiting the same recursion as 
in (4), with 4>y(t) substituted by a vector of the same dimension with elements 

p (yf = y®\uV = «(*)), u = l,...,k. 

2.3 Maximum likelihood estimation 

For an observed sample of n subjects, let y i denote the (univariate or multi- 
variate) response configuration provided by subject i. The log- likelihood of the 
LM model may be expressed as 

(7) £(e) = J2io g [f( yi )}, 

i 

where is the vector of all model parameters arranged in a suitable way. An 
equivalent form is 

(8) £(9) = J2n y log[f(y)}, 

y 

where ny denotes the frequency of the response configuration y in the sample. In 
absence of covariates, using (8) for computing t{0) is more efficient since the sum 
J2y ma y be restricted to all response configurations y observed at least once; 
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we then adopt this formulation. We estimate 6 by maximizing the log-likelihood 
i(9). This may be easily done by the Expectation-Maximization (EM) algorithm 
(Baum et al., 1970; Dempster et al., 1977). 

The EM algorithm is based on the concept of complete data, which in our case 
consist of the response configuration y i (incomplete data) and the configuration 
of the latent process Ui (missing data) for every subject i. An equivalent way 
to represent the complete data, which is more coherent with (8), is by the fre- 
quencies m U y of the contingency table in which the subjects are cross-classified 
according to the latent configuration u and the response configuration y. Under 
this formulation, the complete data likelihood has logarithm 

r(0) = ^^m My log[/(w,y)]. 

u y 

After some simple algebra, in the univariate case we have 

0) m = EEE«S 1 °g[4i ) J + 

t u y 

u t>l U V 

where a« is the number of subjects that at occasion t are in latent state u; with 
reference to the same occasion t, aul is the number of subjects that move from 
latent state u to latent state v, and duy is the number of subjects that are in 
latent state u and respond by y. Note that £*(0) is made of three components 
that may be separately maximized; see Bartolucci (2006) for details. Also note 
that a simplification similar to (9) holds for the multivariate case. 

The frequencies ai , aul and duy above are obviously unknown. Then, the EM 
algorithm proceeds by alternating the following two steps until convergence in 
1(0): 

• E-step: it consists of computing the expected value of each unknown fre- 
quency in (9) given the observed data and the current value of the param- 
eters, so as to obtain the expected value of l*(0). The expected values of 
these frequencies are obtained as: 

4*> = £"yr (1) («lv). 
V 

a { ul = E n s/E r(t) (My), 

V t>l 

y 

where /(•) is the indicator function, r^(u\y) = p(U^ = u\Yi = y), and 
r®(u,v\y) =p(U- t ~ 1 ^ = UiUf 1 = v\Yi = y). 

• M-step: it consists of updating the estimate of by maximizing the ex- 
pected value of £*(0) obtained as above. Explicit solutions are available at 
this aim. In particular, we have: 
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Conditional probabilities of response: <j>® u = - t^t, t = 1, . . . , T, i 



1,... ,k, y = 0, ...,/- 1. 

— Initial probabilities: tt u = , u = 1, . . . , k. 

n 

a (t) 

— Transition probabilities: = . , , t = 2, . . . , T, u, u = 1, . . . , A;. 

A rule similar to the above one must be applied to update the parameters 4^j\\ u 
in the multivariate case. 

A final point is how to compute in an efficient way the posterior probabilities 
r^\u,v\y) on the basis of which we obtain the posterior probabilities r^(u\y) 
via marginalization. First of all consider the probabilities q*^ (u, y) = p{Y^ t+l ^ = 
y^ t+1 \ ■ ■ ■ , K = yi T )\U^ = u). For t = 1, . . . ,T — 1, these probabilities may be 
computed by the backward recursion 

y) = , y)^ ] ^X « = L ■ ■ ■ , *, 

V 

initialized with q*^ T \v,y) = 1, v = 1, . . . , k (Baum et al., 1970; Levinson et al., 
1983; MacDonald and Zucchini, 1997, Sec. 2.2). Then, for t = 2, . . . , T, we obtain 
the posterior probabilities above as 

q^(u,y)Tr {t U {t } t)l q* {t) (v,y) 

r v >(u,v\y) = — - , u,« = l,...,fe. 

/(») 

See Bartolucci and Besag (2002) for an alternative recursion to compute posterior 
probabilities. 

Even in this case, the recursion can be efficiently implemented by the ma- 
trix notation (Bartolucci et al., 2007b). Let q*^\y) be the vector with elements 
q*^(u, y), u = 1, . . . , k. This vector may computed as 

*(t) _ J lj if t = T, 

(10) q * (y) = | n^diag^+i^Q*^ 1 )^), otherwise. 

Then, the kxk matrix R^\y), with elements r®(u, v) arranged by letting u run 
by row and v by column, is obtained as 

diag[q( t - 1 )( 2/ )]nWdiag[^ ) ]diag[q*W(y)] 

(11) R^(y) = , t = 2,...,T. 

A similar implementation holds for the multivariate case by substituting each 
vector 4>ylt) with the vector having elements piY^ = y®\U- = u). 

As typically happens for latent variable and mixture models, the likelihood 
function may be multimodal. In particular, the EM algorithm could converge to 
a mode of the likelihood which does not correspond to the global maximum. In 
order to increase the chance of reaching the global maximum, the EM needs to be 
initialized in a proper way. Further, the value at convergence may be compared 
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with values obtained starting from initial values randomly chosen; for a similar 
multi-start strategy for mixture models see Berchtold (2004). Inference is then 
based on the solution corresponding to the highest value of the likelihood at 
convergence. We take this solution as the maximum likelihood estimates of 6 
donated by 9. 

3. CONSTRAINED VERSIONS OF THE LM MODEL 

In this section, we show how we can adopt a more parsimonious parameter- 
ization on the basis of constraints which typically correspond to hypotheses of 
interest. These constraints may concern: 

• measurement model: the distribution of the response variables given the 
latent process which depends on the conditional response probabilities <p^ u 

or dr. ; 

• latent model: the distribution of the latent process which depends on the 
initial probabilities 7r u and the transition probabilities Tt^) u - 

We also discuss maximum likelihood estimation and the likelihood ratio testing 
of constraints on the model parameters. 

3.1 Constraints on the measurement model 

3.1.1 Univariate case. With only one response variable at each occasion, a 
sensible constraint is that 

(12) <l>y\ u = <Py\u, t = l,...,T,u = l,...,k,y = 0,...,l-l. 

This constraint corresponds to the hypothesis that the distribution of the re- 
sponses only depends on the corresponding latent variable and so there is no 
dependence of this distribution on time. 

More sophisticated constraints may be formulated in the form 

(13) f|W = zW/3, 

where tj$ = g((f>^ u , • • • , 4*i—-\w)> with g(-) being a suitable link function, and Z$ 
a design matrix. With binary response variables, a natural link function is of logit 
type, so that 

s 

0\u 



or a probit link function based on the normal distribution function. With re- 
sponse variables having more than two categories, a natural choice is that of a 
multinomial logit link function, so that r/l has elements 

*® 

lo g-7tf' y = l,...,/-l. 

With ordinal response variables, more sensible logits are of global or continuation 
type; in the first case r]u has elements 

n(Y {t) > v\U {t) - v) 4> (t i +■■■ + 4j?\ 

log P[ * ^ VlU * j = log vlu y = 1 l-l 

rvit) . , rr (t) \ s At) , , At) ' y ± >-"' t 

p(Xi <y\ u i = u ) r \ u + ■ ■ ■ + r y U\ u 
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and in the second it has elements 

r>(Y {t) > v\U {t) - u) 4> { ? + • • • + <$\ 

p(Y^ = (y-l)\U^=u) 

see Colombi and Forcina (2001) and the references therein for a comprehensive 
review of these types of logit. Obviously, constraint (12) may be included into 

(13) by requiring Z$ = Z u , t = 1, . . . ,T, u = 1, . . . , k. However, using constraint 
(12) per se leads to a model which is easier to estimate and does not require to 
rely on a link function. 

An interesting example about parameterization (13) is given in the following 
example. 

Example 1 - LM Rasch model. This is a version of the Rasch (1961) model 
for binary data, based on the assumption that the ability evolves over time, which 
is formulated by assuming 

(14) %|« = log-^=£„-^ s t = l,...,s, u = l, 



a = i,...,c. 

0\u 



This formulation makes sense for data derived from the administration of a set 
of T test items to a group of n subjects, a situation that frequently arises in 
psychological and educational measurement. In this case, £ u may be interpreted 
as the ability level of subjects in latent state u, whereas ip® may be interpreted 
as the difficulty level of item t. Note that in this case we may have response 
variables of different nature, since they correspond to different items. So, strictly 
speaking, we are not in a longitudinal context in which the same response variable 
is repeatedly observed. Nevertheless, the model makes sense as an alternative to 
the latent class Rasch model (De Leeuw and Verhelst, 1986; Lindsay et al, 1991) 
and to test violation of the latter. For a detailed description see Bartolucci et al. 
(2008). Finally note that parametrizations similar to (14) may be adopted with 
ordinal variables on the basis of logits of global or continuation types. 

With binary responses, an advantage of parameterization (14) is that it implies 
the hypothesis of monotonicity, i.e. the latent states can always be ordered so that 
the probability of success increases with the label of the latent class, in symbols 

(15) <f>®^ ■■■<<!>%> t = l,...,T. 

Note that this constraint may be included in the model regardless of a specific 
parameterization as the one above. In particular, Bartolucci (2006) allows for 
inequality constraints in the general form K(3 0, where is a vector of zeros of 
suitable dimension, so that we can obtain ordered latent states in non-parametric 
way; see Bartolucci and Forcina (2005) and the references therein. 

3.1.2 Multivariate case In the multivariate case, the parameterizations sug- 
gested above may be adopted for each single response variable; then, the condi- 
tional distribution of each vector Y± given [/■ is still obtained as in (6). 

In particular, constraint (12) becomes 

( 16 ) 4>j, y \u = ^j'.i/ltf j = l,---,r, t = l,...,T, u = l,...,k, y = Q,...,lj -1. 
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Moreover, we can adopt a specific link function to parameterize the conditional 
distribution of each response variat 
so that we formulate the model as 



distribution of each response variable as in (13). We let ry^ u = 9h(4>j\\ u -> ■ ■ - > 



(17) tig, = Z®J3. 

3.2 Constraints on the latent model 

In absence of individual covariates, no many interesting constraints may be 
expressed on the initial probabilities ir u . The only constraint that may be of 
interest is that of uniform initial probabilities 

(18) 7r u = 1/k, u = l,...,k, 

meaning that at the beginning of the survey each latent state has the same 
proportion of subjects. 

More interesting constraints may be expressed on the transition probabilities. 
These constraints allow us to strongly reduce the number of parameters of the 
model. A simple constraint to express is that the Markov chain is time homoge- 
nous, that is 

(19) ^v\u = n v \u, t = 2,...,T, u,v = l,...,k. 

Note that we can also adopt a constraint of partial homogeneity; for instance, 
Bartolucci et al. (2007b) adopted two different transition matrices, one until 
occasion t and the other for transitions after this occasion, that is 



(20) 




t = 2,...,t, 
t = i+l,...,T, 



with i between 2 and T and u, v = 1, . . . , k. 

More sophisticated parametrizations may be formulated by a linear or a gener- 
alized linear model on the transition probabilities. Linear models have the advan- 
tage of permitting to express the constraint that certain probabilities are equal 
to 0, so that transition between two given states is not possible. In particular, 
Bartolucci (2006) considered a formulation that in our case may be expressed as 

(21) = W®6, 

where pu is the column vector containing the elements of the u-th row of the 
i-th transition matrix, apart from the diagonal element, i.e. ^) u , v = l,...,k, 

v ^ u, and wi is a suitable design matrix. In order to ensure that all the 
transition probabilities are non-negative, we have to impose suitable restrictions 
on the parameter vector S. Due to these restrictions, estimation may be more 
difficult and we are not in a standard inferential problem; see Section 3.4. 

Example 2 - Transition matrices that may be formulated by the linear 
model. The simplest constraint is that all the off-diagonal elements of the tran- 
sition matrix Tl® are equal to each other; with k = 3, for instance, if we also 
assume homogeneity, we have 

i _ 2< y(«) #*) <y(*) 

(22) nW = | $W 1-2$W ], t = 2,...,T. 

Sit) 5 (t) 1 _ 26 {t) 
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This constraint may be formulated by letting all design matrices Wu to be simply 
equal to 1. A less stringent constraint is that each transition matrix is symmetric, 
so that the probability of transition from latent state u to latent state v is the same 
as that of the reverse transition: 



8? $ 
8? i 8? 

V 8? 8f l-(4 4) +e 



Finally, when the latent states are ordered in a meaningful way by assuming, for 
instance, that (15) holds, it may be interesting to formulate the hypothesis that 
a subject in latent state u may move only to latent state v = u + 1, . . . , k. With 
k = 3, for instance, we have 

s? 8f 

n<*>= 1-8$ 8f 

V 1 

A first example of such restriction was provided by Collins and Wugalter (1992), 
in which latent states represent ordered developmental states. According to the 
underlying developmental psychology theory, children may make a transition to a 
next stage but will never return to a previous stage. 

An alternative parameterization is based on using a suitable link function for 
each row of the transition matrix. The model may be then formulated as 



with Ay = g u (ir^ u , . . . , 7r$ ) and V« being a suitable design matrix. A possible 
link function is based on logits with respect to the diagonal element, so that A« 



(23) A« = V®d, 

' k\u J 

bas 

has k — 1 equal to 

(24) log- lr' v = lj ■ ■ ■ ,kj v ^ u ' 

An alternative parameterization, which makes sense with ordered latent states, 
is based on global logits, so that the elements of each vector A« are 

(25) log^ v = 2,...,k. 

Many other link functions may be formulated, such as the same adopted within 
the ordered probit model. 

It is worth noting that we can combine a parametrization of type (23) with 
the constraint that certain transition probabilities are equal to 0. In this case the 
link function g u (-) must be applied to only those elements in the u-th row of the 
t-th. transition matrix that are not constrained to be equal to 0. In general, the 
size of each vector A^ is equal to the number of these elements minus 1 and 
the design matrices Vu in (23) need to be defined accordingly. The following 
example clarifies this case. 
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Example 3 - Tridiagonal transition matrix. A strong reduction of the num- 
ber of parameters may be achieved by assuming that each transition matrix is 
tridiagonal, so that transition from state u is only allowed to state v = u — 1, u-\- 1; 
with k = 4, for instance, we have 



(26) nW 



7T$ \ 

n i\2 ^2 n 3\2 

7T W 7T (t) 7T W 

U ^213 ^313 ^413 

V o o 7$ 7r«y 



T/iis constraint makes sense only if latent states are suitably ordered. In this case, 
it may combined with a parameterization based on logits of type (24) by assuming, 
for instance, 

-1 

= £u + ipv, t = 2, . . . ,T, u = 1, . . . ,v, 

71" V 
u\u 

where v = 2 for u = 1, v = k— 1 for u = k and v = u— l,u+l for u = 2, . . . , k— 1. 
This parameterization may be still expressed as in (23); note that in this case the 
vector containts only one logit for u = 1, k and 2 logits for u = 2, . . . , k — 1. 

3.3 Maximum likelihood estimation 

Under the constraints illustrated in Sections 3.1 and 3.2, maximum likelihood 
estimation of the parameters of the LM model is carried out by the EM algorithm 
illustrated in Section 2.3 in which we only have to modify the M-step according 
to the constraint of interest. In the following, we describe in detail the required 
adjustments. In any case, along the same lines as Shi et al. (2005), it is possible to 
prove that the observed log-likelihood i{0) at each EM iteration is not decreasing 
and, therefore, the algorithm converges to a local maximum of the log-likelihood. 

As usual, in order to increase the chance of converging to the global maximum, 
we recommend a multi-start strategy as that outlined at the end of Section 2.3. 

3.3.1 Constraints on the measurement model. Under constraint (12), the pa- 
rameters of the conditional distribution of the response variables given the latent 
process are updated as 



i>y\u = — t —jK, u = 1, k, y = 0, ...,l- 1. 



A similar rule must be applied in order to update the parameters 4>j )V \u of this 
distribution in the multivariate case, when constraint (16) is assumed. 

Under a parameterization of type (13), this conditional distribution depends 
on the parameter vector (3. This parameter vector is updated at the M-step by 
maximizing the corresponding component of the expected value of the complete 
data log- likelihood, i.e. 
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This may be done by standard iterative algorithms, such as the Fisher-scoring, 
which is based on the score vector and the expected information matrix corre- 
sponding to t\(($). These quantities have specific expressions depending on the 
adopted link function which, in turn, depends on the nature of the response 
variables. 

Consider, for instance, the case of binary response variables in which a logit 
link function is adopted, where the design matrix Z$ specializes into the row 
vector [z$]'. The score vector and the information matrix have, respectively, the 
following expressions: 



?m = EE^i<^>r. 



= EEEO*-*SW 



"o|uni« « 

t u 

Similar expressions may be obtained for the other cases; see Bartolucci (2006) for 
a general discussion which also concerns the use of constraints of type K(3 ^ 0. 
The same iterative algorithms may be used in the multivariate case, when a 
parameterization of type (17) is adopted. 

3.3.2 Constraints on the latent model. Under the constraint of uniform initial 
probabilities, the M-step simply skips the update of this parameters, whose values 
are fixed as in (18). 

About the transition probabilities, when the Markov chain is assumed to be 
time homogenous, see equation (19), these probabilities are updated as follows: 



1t v \ u — *7i— 1) ' u i v — !)•••)»• 



A similar rule may be applied in the partial homogeneity case to update the 
parameters and k^). For the parameters of the first type, the sum X^>i 111 
the equation above needs to be substituted by Ylt=2 f! f° r those of second type, 
this sum must be substituted by J2t=t+i t- 

More difficult is the case in which a linear or generalized linear parameteriza- 
tion is assumed on the transition probabilities. In order to update the parameters 
of these models, i.e. the vector 8 in (21) or (23), we have to update the corre- 
sponding component of the expected value of the complete data log-likelihood: 



*>i 



Even in this case, it is natural to apply iterative algorithms, such as the Fisher- 
scoring. 

For instance, when the multinomial logit parametrization based on logits of 
type (24) is assumed, we have 

m = EE^W-™^ 

t>l u 

= ^EEi^ndiagrf)-^^]'}^, 

t>\ u 
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where tvu is the column vector with k elements equal to 7rV , v = 1, . . . , k, and 

is the vector of the same dimension with elements a^v- in a similar way we 
can obtain the score vector and the expected information matrix for other cases, 
such as that based on global logits. For the case of a linear model on the transition 
probabilities, or parametrizations of the type illustrated in Example 2, we refer 
to Bartolucci (2006). 

3.4 Likelihood ratio testing 

As usual, in order to test a hypothesis Hq expressed through the constraints 
expressed above, we can use the likelihood ratio statistic 

D = -2[£(6) - £(e )], 

where l{0) is the maximum value of the likelihood of the unconstrained model 
and 1{0q) is that of the restricted model. 

When the usual regularity conditions hold, the null asymptotic distribution 
of the above test statistic is of chi-squared type with a number of degrees of 
freedom equal to the number of non-redundant constraints used to formulate Hq. 
The latter is equal to the difference in the number of non-redundant parameters 
between the two models that are compared. These regularity conditions hold 
for most of the constraints formulated in this section. For instance, in the case 
of binary response variables, through D we can test the hypothesis that the 
conditional probabilities of success follow a Rasch model; see equation (14). In 
this case, D has a chi-squared null asymptotic distribution with kT— [T+(k — l)] 
degrees of freedom, where T+(k — l) is the number of non-redundant parameters 
involved in (14). 

The main case where the usual regularity conditions do not hold is when Hq is 
formulated by a linear model on the transition probabilities, see (21), such that 
certain of these probabilities are constrained to be equal to zero. In this case, a 
boundary problem occurs (Self and Liang, 1987) and, as proved by Bartolucci 
(2006), the null asymptotic distribution is of chi-bar-squared type, i.e. a mixture 
of chi-squared distributions with weights which may be computed by explicit 
formulae or estimated by a simple Monte Carlo method; see Shapiro (1988) and 
Silvapulle and Sen (2004), Chapter 3. An interesting result is when we assume 
that the transition matrices depend on only one parameter as in (22). In this case, 
the hypothesis Hq : 5 = may be tested by the likelihood ratio D, whose null 
asymptotic distribution corresponding to a mixture between and a chi-squared 
distribution with one degree of freedom. The weights of this mixture are 0.5 and 
0.5. 

4. INCLUDING INDIVIDUAL COVARIATES AND RELAXING LOCAL 

INDEPENDENCE 

Individual covariates may be included in the measurement model or in the 
latent model. Note that the problem of relaxing local independence is strongly 
related to that of the inclusion of individual covariates in the measurement model, 
because both extensions rely on suitable parameterizations of the conditional 
distribution of the response variables given the latent process. Once a suitable 
parametrization has been adopted, local independence is relaxed by including the 
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lagged response variables, or a suitable transformation of these variables, among 
the individual covariates. 

Before illustrating the technical details, we need to clarify the notation. We 
denote by xf^ the vector of possibly time- varying individual covariates for subject 
i at occasion t, i = l,...,n, t = 1,...,T. As mentioned above, this vector 
may include the lagged response variables, i.e. yf ^ in the univariate case or 
yf ^ in the multivariate case, with even higher-order lags. The notation for the 
conditional response probabilities and initial and transition probabilities must 
be changed to take into account the covariates and the lagged responses. Then, 
for every i, we let 4>f^ u = p(Y^ t) = y\xf\ yf t = 1,...,T, u = l,...,k, 

y = 0, — 1; moreover, we let 7Tj jtt = p(U^ = u\xf^), u = 1, . . . ,k, and 
*?Mu = = v\uf~ l) = u,xf\yf- l) ), t = 2, . . . ,T, u,v = 1, . . . , k. In the 

multivariate case, we have the probabilities 0^i u = PO^ij = v\ x f\vf ^) which 
are defined, together with the probabilities nf\ u , on the basis of the lagged vector 

of response variables yf . Finally, the manifest probability of the response 
configuration y provided by subject i is denoted by fi(y) and corresponds to 
p(Yi = y\Xi), where X: t is the matrix collecting the vector of covariates xf \ 
t = 1, . . . , T, for subject i. 

Before describing in detail the parameterization of the probabilities on which 
the measurement and the latent models are based, an important point to note is 
that we can still compute in an efficient way the manifest probabilities fi(y) by 
using recursion (3). 

4.1 Covariates in the measurement model 

4-1.1 Univariate case. Individual covariates may be included in the measure- 
ment model through a parametrization which recalls that used in (13) to formu- 
late constraints on the conditional distribution of the response variables given 
the latent process. 

For every subject i, let rjf^ = g(i>f^ u , ■ ■ ■ , i>f}-i\ J> with ffQ being a link 
function of the type mentioned in Section 3.1. Then we assume 

(27) fig = Z^P, 

where zf^ is a design matrix depending on the covariates in _X~j. We recall that 
these covariates may include the lagged response variables when we want to relax 
local independence. 

A. 1.2 Multivariate case In the multivariate parametrization of type 

(27) can be adopted for each response variable, having in this way an expression 
that recalls that in (17) and based, in this case, on design matrices denoted by 

Zij^u- 

An LM model for multivariate data in which the conditional distribution of 
the response variables depends on the individual covariates was formulated by 
Bartolucci and Farcomeni (2009). In this formulation, the assumption that the 
response variables in are conditionally independent given U- is relaxed by 
assuming a marginal parameterization of the type described in Bartolucci et al. 
(2007a). This implies formulating a link function with respect to the column 
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vector pf^ having elements piY^ = y^\U^ = u,xf ) for all possible con- 
figurations of yW. In particular, the link functions adopted by Bartolucci and 
Farcomeni (2009) may be formulated as 

(28) V ?l = Clog[MpV], 

where C and M are matrices of simple construction. In particular, the vector rjf^ 
contains marginal logits and marginal log-odds ratios of different types (e.g. local, 
global, continuation) for the distribution p^fl- In this case, we have a unique design 
matrix Z^', having then an expression similar to (27), instead of a transition 

matrix zf) for each response variable. The following example clarifies the type 
of parametrization. 

Example 4 - Marginal model for two response variables. Consider the 
case of r = 2 variables with two and three levels (l\ = 2, I2 = 3, I3 = 3), which 
are treated with logits of type local and global, respectively. Overall, there are 3 
logits and 2 log-odds ratios. The logits may be parametrized as follows 

%l,Q\u 

At) At) 

^i2,0|« 



y i2,0|u ^ Vi2,l\u 



whereas for the log-odds ratios we have 

lQ P(Yg = l,*i2 > V\U? = u^pjY® = 0,Y l2 < y\U® = u,xf) = 
° S p{Y® = 0, Y l2 > y\U® = u, x®)p{Y® = 1, Y l2 > y\uf ] = u, xf) 2+ " 

with y = 1, 2. In this case, the parameter vector (3 is made of the subvectors /3 1; 
/3 2 and f3 3 , and the elements /J4 and f3§. 

4.2 Covariates in the latent model 

A natural way to allow the initial and transition probabilities of the latent 
Markov chain to depend on the individual covariates is by adopting a parametriza- 
tion which recalls that in (23). 

For the initial probabilities, in particular, we assume 

\i = Vidi, 



where A,; = g(iri t i, . . . ,7Tj fc | u ), Vj is a design matrix depending on the covariates 

in x\ and S\ is the corresponding parameter vector. Similarly, for the transition 
probabilities we have 

(29) Ag = V®5 2t 

imsart-sts ver. 2009/08/13 file: lm_reviewl5_submitted.tex date: March 16, 2010 



18 



F. BARTOLUCCI, A. FARCOMENI AND F. PENNONI 



with = <7 u (7iii| u , . . . ,vr i fc | u ), where the design matrix V i u depends on the 

covariates in xf 1 and 82 is the corresponding vector of parameters. In the above 
expressions, g(-) and g u (-), u = 1 . . . , k, are link functions that may be formu- 
lated on the basis on different types of logit. Typically, we use multinomial logits 
or global logits. As reference category, the multinomial logits have the first cat- 
egory when modeling the initial probabilities and category u when modeling the 
transition probabilities. In the latter case, an expression similar to (24) results. 
Global logits, based on an expression similar to (25) are used when the latent 
states are ordered on the basis of a suitable parametrization of the conditional 
distribution of the response variables given the latent process. 

Finally, note that we can combine a parametrization of type (29) with the 
constraint that certain transition probabilities are equal to as in (26). 

4.3 Interpretation of the resulting models 

We introduced two different schemes for including individual covariates in the 
model. In formulating an LM model, we suggest to adopt only one scheme, i.e. to 
include the covariates in the measurement model (under the constraints 7Tj >M = tt u , 
7T-^i = 7T 1 , i = 1, . . . , n) or in the latent model (under the constraint dr., = 

d>, or d>r) , = d>., , i = 1, . ...n). We advice against allowing the covriates to 
affect both the distribution of the latent process and the conditional distribution 
of the response variables given this process. In fact, the two extensions have a 
distinct interpretation. Moreover, the resulting LM model is in general of difficult 
interpretation and its estimation via EM algorithm is often cumbersome. 

About the model interpretation, we have to clarify that when covariates are 
included in the measurement model, the latent variables are seen as way to ac- 
count for the unobserved heterogeneity, i.e. the heterogeneity between subjects 
that we cannot explain on the basis of the observable covariates. The advantage 
with respect to a standard random effect or latent class model with covariates is 
that we admit that the effect of unobservable covariates has its own dynamics; 
for further discussions see Bartolucci and Farcomeni (2009). 

When the covariates are included in the latent model, we typically suppose 
that observeble outcomes indirectly measure a latent trait, such as the health 
condition of elderly people, which may evolve over time. In such a case, the main 
interest is in modeling the effect of covariates on the latent trait distribution; see 
Bartolucci et al. (2009). 

4.4 Maximum likelihood estimation 

In the presence of individual covariates, it is convenient to express the log- 
likelihood through a function similar to (7), that is 

£(0) = 5> g [/^)], 

i 

where fi(yi) is the manifest probability of the response configuration provided 
by subject i given the covariates in X.;. 

The likelihood function can be maximized by an EM algorithm having a struc- 
ture very similar to that outlined in Section 2.3. This algorithm is based on the 
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complete data log-likelihood that, in the univariate case, has expression 

(30) m = e{EEE^[€iJ + 

i t u y 

+ E^i°g(^)+EEE^^[-Sij}' 

u t>l u v ' 

where a^ u is a dummy variable equal to 1 if subject i is in latent state u at 

occasion t; with reference to the same occasion and the same subject, af uv is a 
dummy variable equal to 1 if this subject moves from state u to state v , whereas 
the dummy variable af}„. is equal to 1 if this subject is in state u and provides 
response y. In the multivariate case, we have a similar expression for this function, 
which depends on the probabilities d>.) , . 

The E-step of the algorithm consists of computing the conditional expected 
value of the above dummy variables given the observed data and the current 
value of the parameters. In practice, we compute 

-(*) (i) / i \ 

-(*) (*)/ i \ 

a i,uv = r i T«,%;), 

C/ = i(yf } = y)r {t \u\y t \ 

where r^(it|yj = p(U^' = u\Yi = and rf'^vlyj) = pillf*" 1 ^ = u^Jjf 1 = 
v\Y~i = Vi); the conditioning on the observable covariates in Xi is implicit in these 
expressions. The posterior probabilities may be computed by applying, subject 
by subject, the same recursion illustrated in Section 2.3; see in particular (10) 
and (11). 

The M-step consists of maximizing the complete data log-likelihood expressed 
as in (30), with each dummy variable substituted by the corresponding expected 
value. How to maximize this function depends on the specific formulation of the 
model and, in particular, on whether the covariates are included in the mea- 
surement model or in the latent model. We illustrate this point below, closely 
following the scheme adopted in Section (3.3). 

4-4-1 Covariates in the measurement model. In this case, the initial and tran- 
sition probabilities of the latent process are typically assumed to be equal across 
subjects (see Section 4.3). Then, the common parameters tt u and nty are updated 

as 

u = 1, . . . , k, 





Y-a [1) 


K u = 






n ' 


.(*) 


V a {t) 

i,uv 


•u u 


y.a {t ~ l) 



Ku, = — — TT^TT' t = 2,...,T, u,v = l,...,k. 



On the other hand, under one of the parameterizations illustrated in Section 
4.1, we have to update the parameter vector f3 by an iterative algorithm which 
maximizes 



W) = EEEE^tO 
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in the univariate case and a similar expression in the multivariate case. At this 
aim we suggest to use an iterative algorithm of Fisher-scoring type, the same 
suggested in Section 3.3.1. With binary response variables, in particular, for the 
score vector and information matrix we have the following expressions 

4-4-% Covariates in the latent model. When the covariates are included in the 
latent model, the conditional distribution of the response variables given the 
latent process is the same for all subjects, and depends on the parameters <j>y\ u in 

the univariate case and on the parameters dr. , in the multivariate case. In the 
first case, these parameters are updated as 

<4|l = ^J ' t = l,...,T,u = l,...,k,y = 0,...,l-l. 

a i,u 

A similar rule is adopted in the multivariate case. 

Iterative algorithms are instead required to update the parameters S\ and #2 
involved in the latent model, when this model is formulated as in Section 4.2. 
These parameters are updated by maximizing the functions 

3d) = EE a S lo gKn) 

i u 

to) = £££aS>^SJ 

t>l U V 

For the maximization of both functions we again suggest to use the Fisher-scoring 
algorithm, that may be implemented along the same lines as in Section 3.3.2. The 
score vectors and the information matrices exploited by this algorithm may be 
derived on the basis of standard rules. 

5. MULTILEVEL EXTENSION 

We now consider the extension to multilevel data in which subjects are collected 
into a given number of clusters and one or more random effects are used to model 
the influence of each cluster on the responses provided by the subjects who are 
included. The typical example is that of students collected in classes belonging 
to different schools. This extension is strongly related to that at basis of the 
mixed LM model (van de Pol and Langeheine, 1990; Langeheine and Van de Pol, 
1994; Kaplan, 2008) and to the extension for random effects proposed by Altman 
(2007). 

Even in this case, we may choose to include the random effects in the mea- 
surement model or in the latent model. In the following, we illustrate the model 
based on second choice, which we consider more interesting, closely following 
what proposed by Bartolucci et al. (2010). 

In illustrating the multilevel extension, we denote by H the number of clus- 
ters. Every subject in the sample is identified by the pair of indices hi, with 
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h = 1,...,H and i = 1, . . . , nh, and where n/i is the dimension of cluster h. 
Accordingly, we denote the response of this subject at occasion t by Y^f , the 
vector of all responses provided by this subject by Y^i, and the collection of all 
responses provided by the subjects in cluster h by Yh- Note that in the multi- 
variate case we have a vector of responses Y^J = (yJ^, . . . ,Y^y.), instead of the 
single variable Y® , where each variable Y^j has lj levels, from to lj — 1. 

5.1 Model assumptions 

(1) (T) 

As usual, the LM model relies on a latent Markov chain Uhi = (E/L , • • • , UL ) 
for every subject hi. However, the assumption that these chains are indepen- 
dent, as in the basic model, is relaxed by assuming that, for every cluster h, 
Ufa, . . . , Uhn h are conditionally independent given the latent variable Wh- The 
latter has the role of capturing the heterogeneity between clusters and, for ev- 
ery h, is assumed to have a discrete distribution with support {1, . . . ,m}. The 
assumption of local independence is retained and formulated by requiring that 
each response variable Y$ is conditionally independent of any other variable in 
the model (including the response variables associated to any other subject and 
the latent variables at cluster level), given the corresponding latent state U^J. 
All these assumptions are represented by the path diagram in Figure 3. 



,(D Y Q) ■■■ Y (T) 




Fig 3. Path diagram for the multilevel LM model. 



The above assumptions imply that, for every h, the response vectors Y^i, ■ ■ ■ , Y ' h nh 
are not independent, but they are conditionally independent given the latent vari- 
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able Wh- Marginal independence holds between the vectors of response variables 
Y±, . . . , Y h associated to the different clusters. 

Model parameters are still the conditional response probabilities and the initial 
and transition probabilities of the latent process. The parameters of the first type 
are still denoted by 4>y\ u i t = 1? • • • > T, u = 1, . . . , k, y = 0, — 1, and corre- 
spond to p{Yfa = y\u!> = u); a similar notation is adopted in the multivariate 
case. An extended notation, which takes into account the multilevel structure, is 
instead necessary for the other parameters. In particular, for w = 1, ... ,m, we 
have tt u \ w = p(U$ = u\W h = w), u = 1, . . . , k, and tt^ uw = p{uf ] = v\uf~ l) = 
u, Wh = w), t = 2, . . . , T, u, v = 1, . . . , k, w = 1, . . . , m. These parameters corre- 
spond, respectively, to the initial probabilities and the transition probabilities of 
the Markov chain associated to a subject belonging to a cluster with latent vari- 
able at level w. Finally, we also have to include the parameters for the distribution 
of each variable Wh, and then we let p w = piWh = w), w = 1, . . . , m. 

Under the model assumptions, we can express, for each cluster h, the manifest 
distribution of the vector Y \ as 

(31) f h (y) = p(Y h = y) = J2p( Y h = v\Wh = w) Pw , 

w 

where y has now the same dimension as Yh and is made of the subvectors of 
the same dimension as Y hi, i = 1, . . . , n^. Moreover, we have that 



w 



P (Y h = y\W h = w) = Y[p(Y hi = Vi \W h 

i 

where 

p(Y M = y l \W h = w) = ^0 y m lum * u mJ20% lu v) n %\um 

uW 1 u( 2 ) 1 



E 



„( T )| U (T)%(T)MT-1)- 

This probability may be efficiently computed through recursion (3). 

5.2 Extension to include cluster and individual covariates and to the 
multivariate case 

In the context of multilevel data, covariates are usually available at cluster 
and individual levels. Let Zh denote the column vector of available covariates for 
cluster h, let x,^ be the column vector of available covariates for subject hi at 
occasion t, h = 1, . . . , H, i = 1, . . . , rih, t = 1, . . . , T. These covariates are included 
in the model in a way similar to that illustrated in Section 4.2. Then we extend 
the notation by letting, for each subject hi, ^hi,u\w = Pi^hi = u \Wh = ^'^i^)) 
u = 1, . . . , k, and 7$,,,^ = piU? = v\ut 1] = u,W h = w, x[% t = 2, . . . , T, 
u, v = 1, . . . , k, w = 1, . . . , m; we also let ph, w = p(Wh = w\zh), w = 1, . . . , m, 
for each cluster h. 

Covariates at cluster level are included into the model by adopting a suitable 
parametrization for the probabilities Ph,w A natural choice is the multinomial 
logit parametrization, and then we assume 

(32) log = 70™ + z' h -f lw , w = 2,...,m, 

Ph,l 
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where 7 12 , . . . , ~fi m are vectors of regression coefficients of the same dimension as 
Zh, and 702, • • • ,7om denote the corresponding intercepts. 

A multinomial logit parameterization may also be adopted for the initial and 
transition probabilities of the Markov chain associated to every subject hi. How- 
ever, if the conditional distribution of the response variables given the latent 
process is formulated so that the latent states are ordered, a considerable re- 
duction of parameters is achieved by adopting a parametrizion based on global 
logits. This approach is exploited by Bartolucci et al. (2010), who relied on a 
Rasch parametrization for the distribution of each response variable given the 
corresponding latent state. Then, for the initial probabilities of every Markov 
chain they assume 

log = d 0w + 5 lu + {x hi ') d 2 , u = 1, . . . , k, w = 2, . . . ,m, 

^hi,l\w "T ' ' ' T" ^hi,u—l\w 

(33) 

where 62 is a vector of regression parameters of the same dimension as x^j which 
is common to every level u. Moreover, the intercepts 5o w depend on the level 
of Wh (with Sot = 0), and the intercepts 8\ u depend on the level of Ifj^ . We 
impose 5%2 ^ • • • ^ Si m in order to ensure the invertibility of the global logit 
parametrization. 

Finally, for what concerns the transition probabilities we have 

7T (t) , +---+7T (i) 
/n*\ 1 hi,v+l\uw hi,k\uw , , (t)\/ 

(34) log — = r] w + Viuv + {x\l ) rj2 . 

ftl 11 + ' ' ' + TTfc- 11 

ni,l\uw ni,v— l\uw 

with u = 1, . . . , k, v = 1, . . . , ki — 1, w = 1, . . . , m, and t = 2, . . . ,T. As above, r] 2 
is a vector of regression coefficients for the individual covariates, the intercepts 
t]q w depend on the level of Wh (with 7701 = 0), and the intercepts r]Q UV depend 
on the levels of ^ and and must be decreasing ordered in v. These 
parameters may also be allowed to depend on the time occasion; for instance, we 
can have 77^ in place of r]Q V . 

Finally note the extension of the model presented to the multivariate case, 
where we observe for every hi and t a vector of responses Y^], is straightfor- 
ward. The basic assumption to be added is that the variables in this vector are 
conditionally independent given the corresponding latent state U$ . Bartolucci 
et al. (2010) also addressed this case, allowing for different sets of responses ob- 
served at each occasion. This assumption of conditional independence may also be 
relaxed by adopting a parametrization of type (28), including the lagged response 
variables as already discussed in Section 4. 

5.3 Maximum likelihood estimation 

Given an observed sample, the log-likelihood of the model illustrated above is 
given by 

*(0)=2>g[/ h (y h )], 

h 

where y h is the vector containing the responses observed for all subjects in cluster 
h and the manifest probability /(2//J is defined in (31). 
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The EM algorithm is again the mail tool we suggest to maximize £(0). However, 
its implementation is more difficult with respect to the versions already illustrated 
for the presence of latent variables at two levels. First of all, for the case of 
univariate responses and considering the presence of covariates, the complete 
data log-likelihood may be expressed as 



(35) t{6) = ^r^2b h<w log( Ph , w ) + 

h w 

+ EE{E«S>s[^iJ + 



h i K u 

+ E E E a hluv 1o ^m Muw ] + 



U V t>l 



+ EEE4i,io g [^j}, 



u t y 

where the dummy variables are defined as in Section 3.3 and is a dummy 
variable equal to 1 if cluster h belongs to latent class w (i.e. Wh = w). 

At the E-step, we compute the conditional expected value of the above dummy 
variables given the observed data and the current value of the parameters; at 
the M-step, we maximize the conditional expected value of £*(6) obtained by 
substituting each dummy variable in (35) with the corresponding expected value 
obtained from the E-step. For details on the implementation of these steps we 
refer to Bartolucci et al. (2010). 

6. STANDARD ERRORS, MODEL SELECTION, AND PATH 

PREDICTION 

In this section we show how to obtain standard errors for the model parameters, 
dealing in particular with the information matrix. We also outline the problem 
of model selection for what mainly concerns the choice of the number of latent 
states. Finally, we outline the problem of path prediction, i.e. how to find the 
maximum a posteriori sequence of latent states for a given subject. 

6.1 Standard errors 

It is well known that the EM algorithm, differently from other algorithms 
such as the Newton- Raphson or the Fisher-scoring, does not directly produce the 
information matrix of the model. This matrix is typically used for computing 
standard errors. 

Several methods have been proposed to overcome this difficulty. Most of these 
methods have been developed within the literature on hidden Markov models; 
for a concise review see Lystig and Hughes (2002). The more interesting methods 
are based on the information matrix obtained from the EM algorithm by the 
technique of Louis (1982) or related techniques; see for instance Turner et al. 
(1998) and Bartolucci and Farcomeni (2009). Other interesting methods obtain 
the information matrix on the basis of the second derivative of the manifest 
probability of the response variables by a recursion similar to (3); see Lystig and 
Hughes (2002) and Bartolucci (2006). 

Among the method related to the EM algorithm, that proposed by Bartolucci 
and Farcomeni (2009) is very simple to implement and requires a small extra 
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code over that required for the maximum likelihood estimation. The method 
exploits a well-known result according to which the score of the complete data 
log-likelihood computed at the E-step of this algorithm corresponds to the score 
of the incomplete data log-likelihood. More precisely, we have 



dl{0) dE- [t(e)\Y] 

s(U) - 



89 89 



0=0 



where Eg[£*(0)|"K] denotes the conditional expected value of the complete-data 
log-likelihood computed at the parameter value 9. Then, the observed information 
matrix, denoted by J(9), is obtained as minus the numerical derivative of s(9). 
The standard error of each parameter estimate is then obtained as the square root 
of the corresponding diagonal element of J(^)^ 1 . These standard errors may be 
used for hypothesis testing and for obtaining confidence intervals in the usual 
way. 

The information matrix can also be used for checking local identifiability at 9. 
We consider the model to be local identifiable if the matrix J{9) is of full rank. 
See also McHugh (1956) and Goodman (1974). 

6.2 Model Choice 

In applying a LM model, a fundamental problem is that of the choice of the 
number of latent states, denoted by k. For multilevel versions of this model, it is 
also necessary to choose the number of support points for the latent variables at 
cluster level, denoted by m. 

In order to choose the above quantities, it is natural to use information crite- 
ria such as the Akaike Information Criterion (AIC), see Akaike (1973), and the 
Bayesian Information Criterion (BIC), see Schwarz (1978). According to first cri- 
terion, we choose the number of states corresponding to the minimum of AIC = 
—2£(9)+2g, where g is the number of non-redundant parameters; according to the 
second, we choose the model with the smallest value of BIC = —2l{9) + g\og{n). 

The performance of the two approaches above have been deeply studied in the 
literature on mixture models; see McLachlan and Peel (2000), Chapter 6. These 
criteria have also been studied in the hidden Markov literature for time series, 
where the two indices above are penalized with a term depending on the number of 
time occasions; see Boucheron and Gassiat (2007). From these studies, it emerges 
that BIC is usually preferable to AIC, as the latter tends to overestimate the 
number of latent states. 

The theoretical properties of AIC and BIC applied to the LM models are less 
studied. However, BIC is a commonly accepted model choice criterion even for 
these models. It has been applied by many authors, such as Langeheine (1994), 
Langeheine and Van de Pol (1994), and Magidson and Vermunt (2001). In particu- 
lar, Bartolucci et al. (2009) suggested the use of this criterion together with that of 
diagnostic statistics measuring the goodness-of-fit and goodness-of-classification, 
whereas a simulation study may be found in Bartolucci and Farcomeni (2009). 

6.3 Path prediction 

Once the model has been estimated, a relevant issue is that of path prediction, 
i.e. finding the most likely sequence of latent states for a given subject on the 
basis of the responses he/she provided. For the ith subject in the sample, this is 
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the sequence iii = {u\ , . . . ,u\ ) such that 

p{u\Vi) = maxpOlyJ, 
u 

where we use the notation of Section 2. This definition may be simply adapted 
to more general cases involving individual covariates and/or multilevel data. 

The problem above is different from that of finding the most likely state oc- 
cupied by a subject at certain occasion. This problem may be simply solved on 
the basis of the posterior probabilities of type r^\u\y), whose computation is 
required within the EM algorithm; see Section 2.3. In any case, we can use an 
efficient algorithm which avoid the evaluation of the posterior probability p(u\y) 
for every configuration u of the latent process. This is known as Viterbi algorithm 
(Viterbi, 1967; Juang and Rabiner, 1991), and is illustrated in the following. 

For a given subject i with response configuration y i = (y- , . . . , y\ ), let 
ff \u) = p{ul l) = u, yf ] ) and, for t = 2, . . . ,T, let 

ff (u) = max p{ui 1] =^\...,ut 1] =v (t ~ 1 \U^ =u,y^,...,yf ) ). 

ItC 1 ),...,!^* - 1 ) 

A forward recursion can be used to compute the above quantities, and a backward 
recursion based on these quantities can then be used for path prediction: 

1. for u = 1, . . . , k compute ff\i) as it u p(yf~^ \uf~ = u); 

2. for t = 2, . . . , T and v = 1, . . . , k compute ff' (v) as 

p(y®\U®=v) nmxfrf-V)^)]; 

3. find the optimal state as = arg max u (u) ; 

4. for t = T — 1, . . . , 1, find up as uf = argmax u ff^ftiW .rt+i). 

All the above quantities are computed on the basis of the ML estimate of the 
parameter 6 of the model of interest. 

7. EMPIRICAL ILLUSTRATIONS 

In order to illustrate the approaches reviewed in this paper, we provide a 
synthetic overview of some interesting applications appeared in the literature. 
We describe different datasets, the aspects involved in practically fitting and 
using the LM models, and the interpretation of the results. 

7.1 Marijuana Consumption dataset 

The univariate version of the LM model described in Section 2.1 was applied 
by Bartolucci (2006) to analyze a marijuana consumption dataset based on five 
annual waves of the "National Youth Survey" (Elliot et al., 1989). The dataset 
concerns n = 237 respondents who were aged 13 years in 1976. The use of mari- 
juana was measured through five ordinal variables, one for each annual wave, with 
three categories corresponding to: "never in the past", "no more than once in a 
month in the past year" , and "once a month in the past year" . The substantive 
research question is whether there is an increase of marijuana use with age. 

Using BIC, Bartolucci (2006) selected an LM model with three latent states, 
homogeneous transition probabilities, and a parsimonious parameterization for 
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the measurement model based on global logits, which recalls that in (14). This 
parametrization is based on one parameter for each latent state, which may be 
interpreted as the tendency to use marijuana for a subject in this state, and 
one cutpoint for each response category. Then, the latent states may be ordered 
representing subjects with "no tendency to use marijuana", "incidental users of 
marijuana", and "subjects with high tendency to use marijuana". Also note that 
the cutpoints are common to all the response variables, since these variables 
correspond to repeated measurements of the same phenomenon under the same 
circumstances. This is because the dynamics of the marijuana consumption is 
only ascribed to the evolution of the underlying tendency of this consumption. 

Bartolucci (2006) also tested different hypotheses on the transition matrix of 
the latent process. In particular, he found that the hypothesis that the transition 
matrix has a tridiagonal structure, i.e. 



cannot be rejected. This hypothesis implies that the transition from state u to 
latent state v is only possible when v = u — 1 or v = u + 1. 

The results under the selected LM model say that, at the beginning of the pe- 
riod of observation, the 89.6% of the sample is in the first class (lowest tendency 
to marijuana consumption) and the 1.5% is in the third class (highest tendency 
to marijuana consumption). An interesting interpretation of the pattern of con- 
sumption emerges from the estimated transition matrix. A large percentage of 
subjects remains in the same latent class, but almost 25% of accidental users 
switches to the class of high frequency users. From the estimated marginal prob- 
abilities of the latent classes emerge that the tendency to use marijuana increases 
with age, since the probability of the third class increases across time. 

7.2 Educational dataset 

An interesting illustration of an LM model with individual covariates was given 
by Vermunt et al. (1999). They used data from an educational panel study con- 
ducted by the "Institute for Science Education in Kiel (Germany)" (Hoffmann 
et al., 1985). A cohort of secondary school pupils was interviewed once a year from 
grade 7 to grade 9 with respect to their interests in physics as a school subject. 
The response variables have been dichotomized with categories "low" and "high" 
to avoid sparseness of the observed frequency table. Based on these data, the LM 
model is used to draw conclusions on whether interest in physics depends on the 
interest in the previous period of observation and on two available covariates: sex 
and grade in physics at the present time. 

Vermunt et al. (1999) estimated a univariate LM model with both initial and 
transition probabilities of the latent process depending on the available covariates 
according to a multinomial logit parametrization. In this model, the measurement 
error was constrained to be the same for all time points, meaning that the con- 
ditional distribution of the response variables given the latent state is the same 
for every occasion. Then, they relaxed some of the basic assumptions of the LM 
model, such as the assumption that the Markov chain is first-order. 

According to the parameter estimates of the selected model, there is a sig- 
nificant effect of sex and grade on the interest in physics. Pupils with higher 
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grades are more interested in physics than pupils with lower grades, girls are less 
interested in physics than boys. Moreover, the interest has a positive effect on 
the grade at the next time occasion. For the boys, the probability of switching 
from "low" to "high interest" is larger than that for girls, as well as to keep their 
interest high. 

7.3 Criminal dataset 

The multivariate version of the LM model where both the initial and the tran- 
sition probabilities of the latent process depend on time-constant covariates was 
illustrated by Bartolucci et al. (2007b). They analyzed the conviction histories of 
a cohort of offenders who were born in England and Wales in 1953. The offend- 
ers were followed from the age of criminal responsibility, 10 years, until the end 
of 1993. They were grouped in 10 major categories and gender was included in 
the model as explanatory variable. The analysis was based on T = 6 age bands: 
10-15, 16-20, 21-25, 26-30, 31-35 and 36-40 years. 

The adopted LM model allows to estimate trajectories for behavioral types 
which are determined by the criminal conviction grouping. It also allows to give 
rise to a general population sample by augmenting the observed sample with 
not-convicted subjects. 

According to Bartolucci et al. (2007b), the fit of the model is considerably 
improved by relaxing the assumption of homogeneity of the latent Markov chain, 
but retaining the constraint that males and females have the same transition 
probabilities. In particular, they selected a model based on partially homogeneity, 
as described in Section 3.2; equation (20). Therefore there are two transition 
probability matrices: the first for transitions up to time i and the second from 
time t and beyond. The choice of t = 2 has been made on the basis of the BIC. 

In summary, the selected model is based on a partially homogeneous Markov 
chain with five latent states, different initial and equal transition probabilities for 
males and females. From the estimated conditional probabilities of conviction for 
any offence group and any latent state, it was possible to determine classes of 
criminal activity. In accordance to the typologies found in the criminological la- 
tent class literature, these classes are interpreted as: "non-offenders", "incidental 
offenders", "violent offenders", "theft and fraud offenders" and "high frequency 
and varied offenders" . 

The estimated initial probabilities show that in the first age band the per- 
centage of males who are incidental offenders is higher than that of females. The 
common estimated transition probabilities for males and females from age band 
10-15 to age band 16-20, and from one age band to the others for offenders over 
16, show that the first transition occurs at an early age, 16 years, which in western 
society represents the peak of the age-crime curve. 

At the first time occasions, "incidental offenders" have a quite high probability 
of persistence when moving from the age band 10-15 to age band 16-20. Moreover, 
"theft and fraud offenders" are mainly females and they have a high chance of 
moving to the class of non-offenders. The "high frequency and varied offenders" 
are mainly males and they have a high persistence. 

The estimated transition probabilities from age band 16-20 to the subsequent 
age bands show that the subjects belonging to the latent state of "non-offenders" 
have a very low chance of becoming offenders; "theft and fraud offenders" and 
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"violent offenders" have a high probability of dropping out of crime. Prom the 
estimated proportion of males and females in each latent state at every time 
occasion it can be seen that 7% of males are "violent offenders" at age 16-20 
years and 32% are "incidental offenders" at the same age. Only 3% of females 
are "theft and fraud offenders" at age 16-20 years. 

7.4 Dataset on financial products preferences 

An interesting analysis of data obtained from face to face interviews of the 
household ownership of 12 financial products was offered by Paas et al. (2009). 
The panel was conducted by a market research company among 7676 Dutch 
households in 1996, 1998, 2000 and 2002. To have an accurate representation 
of the products portfolio, the households were asked to retrieve their bank and 
insurance records in order to check which product they owned. Households that 
dropped out were replaced to ensure the representativeness of the sample for 
the population with respect to demographic variables, such as age, income and 
marital status. 

The aim of the study was to get insights on the developments of the individual 
household product portfolio and the effect of demographic covariates on such 
development. It also concentrates on predicting future behaviors of acquisition. 

The authors proposed to use a time homogeneous multivariate LM model, with 
time-varyng covariates affecting the latent process as in Section 4.2. They added 
additional assumptions to the model, such as constant conditional probabilities 
of the response variables given the latent process. This is done to avoid manifest 
changes, so that the product penetration levels are consistent in latent states 
over measurement occasions. Moreover, they formulated the model with a time- 
constant effect of the covariates on the transition probabilities. 

The model selected on the basis of BIC is based on nine latent states. These 
states can be ordered according to increasing penetration levels across the ana- 
lyzed products, which range from bonds, the most commonly owned product, to 
saving accounts. The results highlight some divergences from common order of 
acquisitions, such as the acquisition of a mortgage before owing a credit card or 
viceversa. Loans and unemployment insurance are most often acquired. Accord- 
ing to the estimated transition matrix there is a high persistence in the same 
latent state: only 14% of the households changed latent state in the period of the 
study. The most common switch is from latent class 7 to 8, where latent state 
7 is characterized by the acquisition of mortgage, life insurance, pension fund, 
car insurance, and saving accounts, whereas latent state 8 for all the previous 
products plus the credit card. Another common switch is from latent state 4 to 
7, where the first is characterized by the acquisition of life insurance, pension 
fund, car insurance, and savings accounts. This means that multiple products 
were acquired between consecutive measurement occasions. 

Income, age of the head of the household, and household size have a significant 
effect on the initial and transition probabilities according to the Wald test. The 
covariate values implying a larger probability of belonging to an initial latent 
class also imply a greater probability of switching into the same latent class. For 
example larger households are relatively often found in latent states where overall 
product ownership probabilities are relatively low. 

The prediction of future purchase of a financial product was performed on the 
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basis the posterior latent state membership probabilities for each household at the 
last occasion T, given all other observed information. To assess the accuracy of the 
forecasting, the authors used the Gini coefficient as a measure of concentration. 
Considering the empirical results in the last wave referred to year 2002, which 
was not considered when estimating the model, the authors showed that, for 
most products, the prediction equations are effective for forecasting household 
acquisition. 

7.5 Job position dataset 

The multivariate LM model with covariates affecting the manifest probabilities 
proposed by Bartolucci and Farcomeni (2009) was applied by these authors to 
data extracted from the "Panel Study of Income Dynamics" database (University 
of Michigan). These data concern n = 1446 women who were followed from 1987 
to 1993. The binary response variables are fertility, indicating whether a woman 
had given birth to a child in a certain year and employment, indicating whether 
she was employed. The covariates are: race (dummy variable equal to 1 for a 
black woman), age (in 1986), education (year of schooling), child 1-2 (number of 
children in the family aged between 1 and 2 years, referred to the previous year), 
child 3-5, child 6-13, child 14-, income of the husband (in dollars, referred to the 
previous year). 

The main issue concerns the direct effect of fertility on employment. Also of 
interest are the strength of the state dependence effect for both response variables 
and how these variables depend on the covariates. 

Bartolucci and Farcomeni (2009) used an LM model with covariates affecting 
the manifest probabilities since they were interested in separately estimating the 
effect of each covariate on each outcome. The proposed LM model allows to 
separate these effects from the unobserved heterogeneity, by modeling the latter 
with a latent Markov process. In this way, unobserved heterogeneity effects on 
the response variables are allowed to be time-varying; this is not allowed neither 
within a LC model with covariates nor in the most common random effect models. 

The model selected using AIC and BIC is with three latent states. Under this 
model, race has a significant effect on fertility, but not on employment according 
to the estimates of the parameters affecting the marginal logits of fertility and 
employment and the log-odds ratio between these variables. Age has a stronger 
effect on fertility than on employment. Education has a significant effect on both 
fertility and employment, whereas the number of children in the family strongly 
affects only the first response variable and income of the husband strongly affects 
only the second one. 

The log-odds ratio between the two response variables, given the latent state, 
is negative and highly significant, meaning that the response variables are neg- 
atively associated when referred to the same year. On the other hand, lagged 
fertility has a significant negative effect on both response variables and lagged 
employment has a significant effect, which is positive, on both response variables. 
Therefore, fertility has a negative effect on the probability of having a job posi- 
tion in the same year of the birth, and the following one. Employment is serially 
positively correlated (as consequence of the state dependence effect) and fertility 
is negatively serially correlated. 

From the estimates of the support points for each latent state it may be deduced 
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that the latent states correspond to different levels of propensity to give birth to 
a child and to have a job position. The first latent state corresponds to subjects 
with the highest propensity to fertility and the lowest propensity to have a job 
position. On the contrary, the third latent state corresponds to subjects with the 
lowest propensity to fertility and the highest propensity to have a job position. 
Finally, the second state is associated to intermediate levels of both propensities. 
The two propensities are negatively correlated. 

Overall, it results that the 78.5% of women started and persisted in the same 
latent state for the entire period, whereas for the 21.5% of women had one or more 
transitions between states. The presence of these transitions is in accordance to 
the rejection of the hypothesis that a LC model is suitable for these data. 

7.6 Dataset on anorectic patients 

An interesting extension of the model to account for a hierarchical structure has 
been recently proposed by Rijmen et al. (2007). They illustrated the model by a 
novel application using a data set from an ecological momentary assessment study 
(Vansteelandt et al., 2007) on the course of emotions among anorectic patients. At 
nine occasions for each of the seven days of observations, 32 females with eating 
disorders received a signal and were asked to rate themselves on a 7-point scale 
with respect to the intensity with which they experienced 12 emotional states. 
These were taken from the following emotional categories: "anger and irritation", 
"shame and guilt", "anxiety and tension", "sadness and loneliness", "happiness 
and joy", "love and appreciation". The response has been dichotomized (0-2 vs. 
3-6) and the signal has been considered equally spaced. The aim of the study was 
to detect the course of emotion among the patients. 

As a preliminary analysis, Rijmen et al. (2007) used the univariate version of 
the LM model without covariates. They treated each person by day combination 
as a separate case assuming that the data stemming from different days were 
independent and that the parameters were constant over days. On the basis of 
such model the authors selected four latent states. The first state is interpreted 
as positive mood, the third as negative mood, the second as low intensity for all 
emotions except tension, and the fourth as neutral to moderately positive mood. 
According to the estimated transition matrix there is high persistence in the same 
state. The probability of moving from state 1 to state 2 is 0.14, from state 3 to 4 is 
0.18 and from state 4 to 3 is 0.14 . They noted that there is an indirect transition 
from state 3 to state 1 via the emotionally more neutral state 4. Over the days 
there is an increase of the marginal probabilities of states 1 and 4 indicating that 
the mood of patients tends to become better later on in the day. 

Rijmen et al. (2007) also used a hierarchical LM model by introducing a latent 
variable at day level to account for the fact that data stemming from different 
days are not independent. They modeled the transition between latent states 
at day and signal levels by a first-order time homogeneous Markov chain. They 
estimated a model with two states at the day level and two signal states within 
each day-state. For the first day state, the signal state 1 is characterized by high 
probabilities of experiencing positive emotions and low probabilities of negative 
emotions. Therefore this state is interpreted as positive mood. Instead the second 
day state is interpreted as negative mood. In the signal state 2 positive emotions 
are not well separated from negative emotions and the state is considered as an 
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emotionally neutral to moderately positive state. A tendency to experience more 
positive emotions emerges from the estimated initial and conditional probabilities 
of the chain over day and signal. 

7.7 Dataset on student math achievement 

A multilevel version of the latent Markov model was applied by Bartolucci 
et al. (2010) to analyze how the cognitive math achievement changes over time 
of schooling. Multilevel models are standard tools for the analysis of such kind of 
data as they allow to explicitely consider a hierarchical structure such as students 
nested in classes and schools. 

The data derived from the repeated administration of test items to students 
attending public and non-public middle schools in an Italian Region. A set of 
dichotomously scored items was administered at the end of each of the three 
years of school (28 at the end of the first year, 30 at the end of the second, 
and 39 at the end of the third year) to 1,246 students who progressed from 
Grade 6 to Grade 8. They are from 13 public and 7 non-public middle schools. 
The recorded social background characteristics of the students are father's and 
mother's education. The school characteristics are: number of students, number 
of teachers, students-teachers ratio, and years since school opened. 

The main issue concerns the evaluation of the performance of schools, mea- 
sured in terms of achievement attained by pupils at the end of the period of 
formal schooling. The proposed approach takes into account imperfect persis- 
tence of achievement, heterogeneity in learning and measurement error in test 
scores contributing to get an unbiased estimate of the added value of public and 
non-public schools. 

On these data the authors fitted a multilevel LM model for multivariate data 
with a structure similar to that presented in Section 5. On the basis of the BIC, 
they selected the model based on m = 4 support points for the latent variable 
at cluster level and k = 6 states for the latent Markov chain at individual level. 
Each latent state is associated to an estimated the ability level, so that these 
states can easily identify the group of the most proficient students and that of 
the least proficient students. The ordered estimated conditional probabilities for 
each latent class and each set of items administered at each grade show that the 
difficulty of the items is increasing over time and they allow to identify the items 
which are tailored to distinguish less capable students. 

The estimates of the intercepts and the regression coefficients for the logistic 
model at cluster level based on parametrization (32) allowed interpretation of the 
four cluster latent classes. The class of cluster D is mainly characterized by the 
non-public schools with less than 18 years since school opened and with a ratio 
between students and teachers smaller than 8. As the value of the covariate "ratio 
between student and teachers" increases there is more chance that the class is of 
type A rather than of type B. As the value of the year of activity of the school 
increases there is more chance that the class is of type A rather than of type B. 

The estimated parameters of the global logits defined on the initial and transi- 
tion probabilities are interpreted on the basis of formulae (33) and (34). According 
to them the classes of those schools belonging to type B are less helpful in increas- 
ing math ability in the first year of the middle school, as compared to the classes 
of the other clusters. The same classes contribute less also from Grade 6 to Grade 
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7 but they contribute the most to student's math ability from Grade 7 to 8. The 
ability of the students increases for those having highly educated fathers. The 
magnitude of this increase is stronger at Grade 6 compared to the other grades. 
They identify how the covariates related to each class affect the probability that 
this class is of type A, B, C or D. The most interesting conclusion is that classes 
of students in public schools are mostly of type A (32%), B (23%), and C (38%), 
whereas those in non-public schools are only of type A (79%) and D (21%). The 
estimated regression coefficients related to the individual covariates show that 
the ability of the students increases for those having higher educated fathers. 

8. CONCLUSIONS AND FURTHER DEVELOPMENTS 

In this paper we presented a review of the latent Markov model which starts 
with the basic model of Wiggins (1973), for both univariate and multivariate data. 
We also illustrated extensions to the inclusion of covariates and to multilevel data, 
and outlined maximum likelihood estimation and related inferential methods. We 
tried to keep the presentation simple and provide references to methods which 
are of easy implementation, taking ideas from methods widely applied in the 
literature on hidden Markov models for time series. Moreover, in order to retain 
a simple presentation we dealt with the case of balanced data in which all subjects 
are observed at the same number of occasions, the same response variables are 
obtained at every occasion, and there are no missing responses. 

The potentialities of the LM approach are illustrated by summarizing the re- 
sults of several applications available in the literature. It is worth noting that 
the number of applications available in the literature is growing; recent applica- 
tions also involve situations which are not typically those of longitudinal data. 
Examples concern the estimation of closed-populations on the basis of capture- 
recapture data (Bartolucci and Pennoni, 2007) and the assessment of the peer 
review process of grant applications (Bornmann et al., 2008). 

In this section we would like to summarize further developments of the LM 
models that, in our opinion, deserve particular attention: 

• More flexible parametrizations for the conditional distribution of the re- 
sponse variables given the latent process: We refer in particular to param- 
eterizations which address the problem of multidimensionality when the 
model is applied, for instance, to data coming from the administration of 
test items assessing different types of ability. This is a problem well known 
in the Item Response Theory, which may concern different fields of appli- 
cations such as psychology and assessment of the quality of life. Flexible 
parameterizations are also of interest when the response variables are of 
mixed nature; for instance, we may have a mixture of binary, ordinal, and 
continous variables. 

• Higher-order Markov chains: the assumptions that the latent Markov chain 
if of first-order may be restrictive when there is "memory effect" such that 
the latent state of a subject at a given occasion depends on two or more 
previous occasions. The extension to second or higher-order Markov chains 
is rather simple to implement and has been already applied in certain con- 
texts. However, to our knowledge, a systematic illustration of this higher- 
order approach does not seem to be available in the literature. 
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• Missing responses: It is well known that longitudinal datasets typically suf- 
fer from the problem of missing responses due to several reasons. In apply- 
ing an LM model, a trivial solution to this problem consists in eliminating 
from the sample subjects with at least one missing response. Obviously, this 
solution may lead to a strong bias of the parameter estimates when the as- 
sumption that these responses are missing at random is not plausible. The 
interest here is in formulating LM models in which the event of missing 
response is explicitly modeled given the latent state and the evolution of 
the latent process takes into account this possibility by adopting a suitable 
parametrization of the transition matrices. 

• Multilevel data with evolution at cluster level: We presented in Section 5 a 
multilevel extension of the LM model in which only one latent variable is 
used to represent the effect of each cluster. However, this effect may have 
its own dynamics as the effect at individual level. Hence, of interest is the 
extension in which a sequence of occasion-specific latent variables is used to 
model the effect of each cluster. The resulting model may then be formu- 
lated by assuming the existence of two nested latent Markov chains. Given 
the chain at cluster level, another chain is associated to each individual 
in the cluster. This model if of particular interest, but it seems difficult 
to provide a simple method for maximum likelihood estimation, given the 
difficulty of obtaining the manifest distribution of the response variables. 
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